1 About the Data

The available data comes from an online shop, which sells beauty products. There are two datasets given: One with data about customer orders and another with data about customer clicks on the website.

The orders dataset consists of one row for each customer order in the time periode from the 28. January 2000 to the 3. March 2000. The data contains values such as the number of product units ordered, the total order amount, payment information, the manufacturer and brand names of the ordered products and social data about the customer, as by example his location or age.

The clicks dataset contains data referring to customer clicks on the website of the given company. Therefore, it is composed of data such as the timestamp of a click, the session length and the name of the clicked product, as well as the product category it belongs to. In this way the whole session course of a customer can be illustrated through the data.

Both datasets share a considerable amount of columns. However, since not every click results in an order and since a session consists normally of more than one click, the contents differ significantly.

2 Task Description

The project’s task is to analyze the dataset, especially by creating plots and statistical tables for the data, that is suspected to be relevant for the online shop in some way.

In addition to that the performance of different Recomendation systems has to be analyzed.

Finally, a model has to be constructed to predict…

3 Cleaning the Data

Before we started cleaning the data, we copied it to the folder “0 Data”. The reason for this was to avoid accidentally altering the original dataset by separating the edited datasets from the original ones. The cleaned data and further forms of the datasets were also saved to this folder. Before explaining the cleaning process, it makes sense to get an overview of the content of the aforementioned folder. It can be described as follows:

  • Each file has a suffix depending on what language was used for creating it. Files created with a Python script have the suffix “_P“, while files created with R have the suffix”_R“.
  • For both datasets three types of files are created:
    • A copy of the original dataset (e.g. “order_data_R.csv”)
    • A cleaned version of that dataset (e.g. “order_data_cleaned_R.csv”)
    • A smaller version of the cleaned data, which allows quick viewing and testing of the technical functionality of the coding (e.g. “order_data_small_R.csv”)

3.1 Cleaning with R

The cleaning process consists of the following steps:

  1. Copy the original datasets to the folder “0 Data”

  2. Read the data, add headers (i.e. column labels), replace “?” and “NULL” with NA, drop columns which have a 100% ratio of missing data, reformat datetime cells and save the result (e.g. as “order_data_cleaned_R.csv”)

  3. Create a subset of the cleaned data, containing only 1000 rows, and save it (e.g. as “order_data_small_R.csv”)

3.2 Cleaning with Python

A similar cleaning process to the one explained above has been implemented in Python.

Note: Python coding chunks are excutable in RMarkdown in general, but the Python environment is not persistent across different python chunks for the preview function ro run coding. Despite this, the chunks are compiled together, when the document is knitted.

3.3 Cleaning differences of Python and R

To test if the cleaning scripts in Python and R result in the same file, execute the following code in RStudio. The package creates a diff view

library(diffobj)
pathOrders <- "0 Data/order_data_cleaned_R.csv"
pathClicks  <- "0 Data/clickstream_data_cleaned_R.csv"
pathOrdersPython <- "0 Data/order_data_cleaned_P.csv"
pathClicksPython  <- "0 Data/clickstream_data_cleaned_P.csv"

orders <- read.csv(file=pathOrders)
clicks <- read.csv(file=pathClicks)
clicksP = read.csv(file=pathClicksPython,na.strings=c("","NA"))
diffPrint(target=clicksP,current=clicks)
ordersP = read.csv(file=pathOrdersPython,na.strings=c("","NA"))
diffPrint(target=ordersP,current=orders)

3.4 Merging

Furthermore, we tried to merge the click and order data in Python by trying different ID combinations that occure in both datasets. For testing the different combinations we used an inner join in order to be able to recognize easier whether a merging try had success. We tried the following combinations for merging the two datasets, which resulted in the shown shapes for the merged dataset:

Clicks Orders Shape
Session ID Order Line Session ID [0, 438]
Session ID Order Session ID [0, 438]
Customer ID Customer ID [6906, 437]
Session Cookie ID Order Line Session ID [0, 438]
Session Cookie ID Order Session ID [0, 438]


In this way we were able to discover that it is possible to join on the ‘Customer ID’ for some instances. Thus, we saved a dataset for the merging results on the Customer ID. But the merged data makes rather less sense since a Customer ID can have multiple order and click rows. Because of this issue, we decided on building a second, smaller data subset containing only the customer information columns of both original datasets. The final merged customer dataset contains 80 attributes for 97 customers.

4 Analyzing the Data

The aim of the data analysis is to extract information, which is suspected to be valuable to the online shop, and prepare it in a way that makes it easily “digestible”. The overview of the information is presented in statistical tables and plots.

4.1 Missing Data

Before creating overview tables or plots for columns, it makes sense to look which columns actually contain a large quantity of information and which do not. To do a check up on the ratio of filled cells, we created a ranking for both datasets containing column names and the percentage of missing data for each column. Columns with a low percentage of missing data are then preferred in later analysis steps. The first 50 entries in the resulting rankings can be seen in the following two tables (left: Orders, right: Clicks).

Column_Name NA_percentage
Order.Line.Date 0.0000
Order.Line.Date_Time 0.0000
Order.Line.Unit.List.Price 0.0000
Order.Line.ID 0.0000
Order.Line.Quantity 0.0000
Order.Line.Unit.Sale.Price 0.0000
Order.Line.Status 0.0000
Order.Line.Tax.Amount 0.0000
Order.Line.Amount 0.0000
Order.Line.Day.of.Week 0.0000
Order.Line.Hour.of.Day 0.0000
City 0.0000
US.State 0.0000
Account.Creation.Date 0.0000
Account.Creation.Date_Time 0.0000
Account.Status 0.0000
Customer.ID 0.0000
Order.Date 0.0000
Order.Date_Time 0.0000
Order.Customer.ID 0.0000
Order.Modification.Date 0.0000
Order.Modification.Date_Time 0.0000
Order.System.Name 0.0000
Order.System.Number 0.0000
Order.ID 0.0000
Order.Source 0.0000
Order.Status 0.0000
Order.Amount 0.0000
Order.Shipping.Amount 0.0000
Order.Tax.Amount 0.0000
Order.Day.of.Week 0.0000
Order.Hour.of.Day 0.0000
Product.Hierarchy.Level 0.0000
Product.Object.ID 0.0000
Product.Object.Status 0.0000
Product.Creation.Date 0.0000
Product.Creation.Date_Time 0.0000
Product.Modification.Date 0.0000
Product.Modification.Date_Time 0.0000
Spend.Over..12.Per.Order.On.Average 0.0000
Order.Discount.Amount 0.0020
ActionCode 0.0035
Order.Line.Assortment.ID 0.0063
Order.Line.Subassortment.ID 0.0063
SendEmail 0.0087
Email 0.0087
Country 0.0283
Order.Num.Payments 0.0447
Legwear_size 0.0494
MyLegsColorRefNo1 0.0511
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Column_Name NA_percentage
Request.Processing.Time 0.0000
Request.Date 0.0000
Request.Date_Time 0.0000
Request.Sequence 0.0000
Request.Template 0.0000
REQUEST_DAY_OF_WEEK 0.0000
REQUEST_HOUR_OF_DAY 0.0000
Cookie.First.Visit.Date 0.0000
Cookie.First.Visit.Date_Time 0.0000
Session.First.Request.Date 0.0000
Session.First.Request.Date_Time 0.0000
Session.Cookie.ID 0.0000
Session.ID 0.0000
Session.User.Agent 0.0000
Session.Visit.Count 0.0000
Session.First.Processing.Time 0.0000
Session.First.Template 0.0000
Session.First.Request.Day.of.Week 0.0000
Session.First.Request.Hour.of.Day 0.0000
Session.First.Content.ID 0.0010
Request.Assortment.ID 0.0017
Content.Level 0.0067
Content.Creation.Date 0.0067
Content.Creation.Date_Time 0.0067
Content.ID 0.0067
Content.Modification.Date 0.0067
Content.Modification.Date_Time 0.0067
Content.Status 0.0067
Content.Type 0.0067
Content.Level.1.Path 0.0067
Content.Level.2.Path 0.0067
Content.Level.3.Path 0.0067
Assortment.Type 0.1764
Assortment.ID 0.1764
Assortment.Status 0.1764
Assortment.Creation.Date 0.1764
Assortment.Creation.Date_Time 0.1764
Assortment.Modification.Date 0.1764
Assortment.Modification.Date_Time 0.1764
Assortment.Level 0.1764
Assortment.Level.1.Path 0.1764
Assortment.Level.2.Path 0.1764
Request.Query.String 0.2248
Request.Referrer 0.2376
Request.Subassortment.ID 0.3006
Session.First.Referrer 0.4104
Assortment.Level.3.Path 0.4753
Assortment.Level.4.Path 0.6609
Session.First.Query.String 0.6713
VendorPageColor 0.7896

4.2 Structure and Content

4.2.1 Order Data

The order data can be mainly devided into 4 sections:

  1. Customer Data: For this section we regard all information referring to the customer as an individual. This data contains information such as the customer gender, location, family status and retail activities.

    • City
    • Country
    • US.State
    • Age
    • Marital.Status
    • Gender
    • Audience
    • Truck.Owner
    • RV.Owner
    • Motorcycle.Owner
    • Working.Woman
    • Presence.Of.Children
    • Speciality.Store.Retail
    • Oil.Retail.Activity
    • Bank.Retail.Activity
    • Finance.Retail.Activity
    • Miscellaneous.Retail.Activity
    • Upscale.Retail
    • Upscale.Speciality.Retail
    • Retail.Activity
  2. Product Data: The following data columns describe features of the ordered products.

    • StockType
    • Manufacturer
    • BrandName
  3. Payment Data: This sections contains columns which describe the payment methods used by the customers.

    • Order.Credit.Card.Brand
    • Bank.Card.Holder
    • Gas.Card.Holder
    • Upscale.Card.Holder
    • Unknown.Card.Type
    • TE.Card.Holder
    • Premium.Card.Holder
    • New.Bank.Card
  4. Order Data: The order data section contains information describing the order process itself, such as order quantity and price data.

    • Order.Line.Quantity
    • Order.Line.Unit.List.Price
    • Order.Line.Amount
    • Spend.Over..12.Per.Order.On.Average
    • Order.Line.Day.of.Week
    • Order.Line.Hour.of.Day
    • Order.Promotion.Code
    • Order.Discount.Amount

4.2.2 Clickstream Data

The clickstream data has three main categories:

  1. Customer Data: The data contains a vast collection of information about customers, reaching from usual informations like age, gender, etc. over payment information and finacnial activities to opinions about the shop.

    • WhichDoYouWearMostFrequent
    • YourFavoriteLegcareBrand
    • Registration.Gender
    • NumberOfChildren
    • DoYouPurchaseForOthers
    • HowDoYouDressForWork
    • HowManyPairsDoYouPurchase
    • YourFavoriteLegwearBrand
    • WhoMakesPurchasesForYou
    • NumberOfAdults
    • HowDidYouHearAboutUs
    • SendEmail
    • HowOftenDoYouPurchase
    • HowDidYouFindUs
    • City
    • US.State
    • Year.of.Birth
    • Email
    • Truck.Owner
    • RV.Owner
    • Motorcycle.Owner
    • Value.Of.All.Vehicles
    • Age
    • Other.Indiv…Age
    • Marital.Status
    • Working.Woman
    • Mail.Responder
    • Bank.Card.Holder
    • Gas.Card.Holder
    • Upscale.Card.Holder
    • Unknown.Card.Type
    • TE.Card.Holder
    • Premium.Card.Holder
    • Presence.Of.Children
    • Number.Of.Adults
    • Estimated.Income.Code
    • Home.Market.Value
    • New.Car.Buyer
    • Vehicle.Lifestyle
    • Property.Type
    • Loan.To.Value.Percent
    • Presence.Of.Pool
    • Year.House.Was.Built
    • Own.Or.Rent.Home
    • Length.Of.Residence
    • Mail.Order.Buyer
    • Year.Home.Was.Bought
    • Home.Purchase.Date
    • Number.Of.Vehicles
    • DMA.No.Mail.Solicitation.Flag
    • DMA.No.Phone.Solicitation.Flag
    • CRA.Income.Classification
    • New.Bank.Card
    • Number.Of.Credit.Lines
    • Speciality.Store.Retail
    • Oil.Retail.Activity
    • Bank.Retail.Activity
    • Finance.Retail.Activity
    • Miscellaneous.Retail.Activity
    • Upscale.Retail
    • Upscale.Speciality.Retail
    • Retail.Activity
    • Dwelling.Size
    • Dataquick.Market.Code
    • Lendable.Home.Equity
    • Home.Size.Range
    • Lot.Size.Range
    • Insurance.Expiry.Month
    • Dwelling.Unit.Size
    • Month.Home.Was.Bought
    • Available.Home.Equity
    • Minority.Census.Tract
    • Year.Of.Structure
    • Gender
    • Occupation
    • Other.Indiv…Gender
    • Other.Indiv…Occupation
  2. Product Data: The following data columns describe features of products clicked on by the customers.

    • BrandName
    • UnitsPerInnerBox
    • PrimaryPackage
    • Depth
    • VendorMinREOrderDollars
    • Height
    • UnitsPerOuterBox
    • StockType
    • Pack
    • ProductForm
    • Look
    • BasicOrFashion
    • MfgStyleCode
    • SaleOrNonSale
    • Length
    • MinQty
    • LeadTime
    • Weight
    • HasDressingRoom
    • ColorOrScent
    • Width
    • Texture
    • Manufacturer
    • ToeFeature
    • Category2
    • Material
    • CategoryCode
    • UnitIncrement
    • WaistControl
    • Collection
    • BodyFeature
    • Audience
    • Category1
    • Product
    • Pattern
  3. Time Data: The following data columns describe different dates and times for clicks.

    • Request.Date
    • Request.Date_Time
    • Request.Sequence
    • REQUEST_DAY_OF_WEEK
    • REQUEST_HOUR_OF_DAY
    • Cookie.First.Visit.Date
    • Cookie.First.Visit.Date_Time
    • Session.First.Request.Date
    • Session.First.Request.Date_Time
    • Session.Start.Login.Count
    • Session.Cookie.ID
    • Session.ID
    • Session.Customer.ID
    • Session.Visit.Count
    • Session.First.Content.ID
    • Session.First.Request.Day.of.Week
    • Session.First.Request.Hour.of.Day

4.3 Statistical Tables

Given a subset of interesting columns, we create two types of statistical tables for each: One table for numerical columns in the subset and another for factors. The statistical table for the numerical data contains the maximum value, minimum value, mean, median and standard deviation for each column. Whereas the factorial tables contain the five most frequent factors as well as their percentage, the ratio of NAs and other factors for each column. For this, the NA percentage gets calculated at first, then the NA values are deleted from the regarded column and the percentage for each value is calcuated.

4.3.1 Order Data

In the following section the statistical tables generated for the purpose of describing the order data are shown. Additionally, the most important or interesting analysis results are emphasized and shortly explained.

Note: To support a better visualization, more relevant columns are highligthed in black color.

4.3.1.1 Customer Data

Only the age can be regarded as a numerical customer data column here. As most intersting in this summary might appear the mean and the median, which both imply an average customer segment consisting of people in their late 30s.

Variable Max Mean Median Min SD
Age 98 38.37 36 18 10.87

The following data summary shows some social data for the shop’s customer segment. Since all of the available data for the country column contains the value ‘United States’, it is highly probable that the online shop exclusively delivers customer located in the US. This was the reason for us to choose a map of the United States in order to visualize the customers’ locations later on in the plotting. Furthermore, the data clearly shows that the main customer audience targeted are women.

Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
City New York: 4.53% San Francisco: 2.05% Stamford: 1.24% Austin: 1.13% Brooklyn: 0.98% 90.07% 0%
Country United States: 100% 0% 2.83%
US.State CA: 14.63% NY: 14.11% TX: 6.93% PA: 5.8% CT: 5.28% 53.25% 0%
Marital.Status Married: 66.13% Single: 22.02% Inferred Single: 7.15% Inferred Married: 4.7% 0% 34.98%
Gender Female: 83.06% Male: 16.94% 0% 44.96%
Audience Women: 81.17% Men: 12.5% Children: 6.33% 0% 11.08%
Truck.Owner False: 78.55% True: 21.45% 0% 22.22%
RV.Owner False: 91.5% True: 8.5% 0% 22.22%
Motorcycle.Owner False: 98.66% True: 1.34% 0% 22.22%
Working.Woman False: 68.79% True: 31.21% 0% 22.22%
Presence.Of.Children False: 54.66% True: 45.34% 0% 22.22%
Speciality.Store.Retail False: 84.12% True: 15.88% 0% 22.22%
Oil.Retail.Activity False: 91.8% True: 8.2% 0% 22.22%
Bank.Retail.Activity False: 75.44% True: 24.56% 0% 22.22%
Finance.Retail.Activity False: 91.69% True: 8.31% 0% 22.22%
Miscellaneous.Retail.Activity False: 94.88% True: 5.12% 0% 22.22%
Upscale.Retail False: 94.25% True: 5.75% 0% 22.22%
Upscale.Speciality.Retail False: 96.44% True: 3.56% 0% 22.22%
Retail.Activity False: 60% True: 40% 0% 22.22%

4.3.1.2 Product Data

The selected columns belonging to the product information section show only factorial values. The statistical overview for the product data reveals that most of the sold articles are replenishable. The strongest brand in the current orders is American Essential, which seems to manufacture its articles by itself. Important is to recognize that the presented data is biased in a way: Because the given dataset shows only ordered products, it can be assumed that the popularity of articles distorts all percentual information referring to the products. Thus, we can not make any assumptions referring to the original product palette the store offers.

Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
StockType Replenishable: 69.58% Seasonal 1: 23.08% Replenishment: 5.14% Seasonal 1*: 1.83% Seasonal 2: 0.24% 0% 14.72%
Manufacturer American Essentials: 20.91% Ridgeview: 16.64% HAN: 13.16% Donna Karan Company: 10.85% HOSO: 10.67% Peneco: 8.98% 0%
BrandName AME: 22.07% HOSO: 11.26% ELT: 10.81% Silk Reflections: 9.35% DAN: 7.92% 27.51% 11.08%

4.3.1.3 Payment Data

When it comes to the data concerning the used payment methods, there are only factorial columns as well. The most used credit card is by far the VISA card. Furthermore almost a fifth of the customers uses a premium card. From this information it could be deduced how wealthy the customer segment is by comparing the ratio of premium cards to the one in the whole population.

Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Order.Credit.Card.Brand VISA: 59.94% MC: 25.43% AMEX: 14.31% DISC: 0.31% NA 0% 16.71%
Bank.Card.Holder True: 86.57% False: 13.43% NA 0% 22.22%
Gas.Card.Holder True: 75.81% False: 24.19% NA 0% 22.22%
Upscale.Card.Holder True: 54.1% False: 45.9% NA 0% 22.22%
Unknown.Card.Type False: 56.18% True: 43.82% NA 0% 22.22%
TE.Card.Holder False: 89.42% True: 10.58% NA 0% 22.22%
Premium.Card.Holder False: 75.88% True: 24.12% NA 0% 22.22%
New.Bank.Card False: 99.55% True: 0.45% NA 0% 22.22%

4.3.1.4 Order Process Data

The numerical data for the order process shows that a customer usally buys one product per order. Also the order line amount implies that the store offers rather inexpensive articles. Furthermore, the minimum value for both the order line quantity and the order line amount is negative, which hints to the assumption of eather the order data containing returns as well or having errors in it. Through the discount amount it is possible to state that the store offers a maximum of a 50% price reduction for the given time period. Also the buyed articles got by average a discount of about 9%. Again it can be assumed that this data is not representative for the shop’s offer in general, because it is probable that articles with a higher discount are bought more often.

Variable Max Mean Median Min SD
Order.Line.Quantity 18 1.31 1.0 -2 0.95
Order.Line.Unit.List.Price 72 9.26 7.5 0 6.46
Order.Line.Amount 234 11.62 10.0 -40 11.51
Order.Line.Hour.of.Day 23 13.04 13.0 0 5.29
Order.Discount.Amount 50 8.82 10.0 0 9.98
The factors for the order process demonstrate that the ‘FRIEND’ discount is used most often and in the majority of the orders. (At this point it would be relevant for the interpretation to know for whom and under which circumstances this discount is given.) Also the weekday summary could be relevant for sales purposes, by example for finding out the most successfull time for showing ads to possible customers.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Spend.Over..12.Per.Order.On.Average False: 64.04% True: 35.96% 0% 0%
Order.Line.Day.of.Week Wednesday: 26.96% Tuesday: 17.72% Thursday: 17.37% Friday: 16.48% Saturday: 8.43% 13.04% 0%
Order.Promotion.Code FRIEND: 82.09% SPRING: 2.14% MARCH1: 1.92% FREE: 1.39% 4128003160593466: 1.13% 0% 23.15%

4.3.2 Clickstream Data

4.3.2.1 Customer Data

tbd

Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
WhichDoYouWearMostFrequent casual socks: 37.12% hosiery: 33.89% athletic socks: 17.03% trouser socks: 11.96% 0% 95.38%
YourFavoriteLegcareBrand Nature Made: 13.58% Conair: 13.51% Epilady: 9.33% Lucky Chick: 9.13% DailyHerbs: 6.41% 0% 96.6%
Registration.Gender Female: 59.52% Male: 40.48% 0% 99.85%
NumberOfChildren 0: 54.76% 2: 18.25% 4 or more: 13.89% 1: 13.1% 0% 99.85%
DoYouPurchaseForOthers False: 100% 0% 86.14%
HowDoYouDressForWork business casual: 42.43% very casual: 26.26% business dress: 16.97% comfortable / athletic: 14.34% 0% 95.46%
HowManyPairsDoYouPurchase 15 or more: 55.78% 11 to 15: 25.65% 1 to 5: 13.01% 6 to 10: 5.56% 0% 95.59%
YourFavoriteLegwearBrand Hanes: 36.85% DKNY: 16.21% Danskin: 8.67% Donna Karan: 8.22% Berkshire: 6.6% 0% 95.42%
WhoMakesPurchasesForYou spouse: 41.29% friend: 30.15% parent: 28.22% siblings: 0.35% 0% 99.12%
NumberOfAdults 2: 35.71% 3 or more: 34.13% 1: 30.16% 0% 99.85%
HowDidYouHearAboutUs other: 38.81% friend / family: 31.82% e-mail: 16.02% print ad: 11.58% in the news: 1.22% 0% 87.9%
SendEmail True: 66.3% False: 33.7% 0% 85.98%
HowOftenDoYouPurchase every 6 months: 77.96% once a year: 14.1% each week: 7.94% 0% 96.38%
HowDidYouFindUs Friend/Co-worker: 59.92% Other: 26.98% News Story: 13.1% 0% 99.85%
City New York: 2.36% Dallas: 1.35% San Francisco: 1.31% Stamford;: 1.22% stamford: 1.01% 6.77% 85.98%
US.State NY: 12.7% CA: 12.55% TX: 6.37% CT: 5.83% PA: 5.17% 0% 85.98%
Email COM: 72.94% NET: 20.09% EDU: 2.57% Other: 2.07% Gazelle: 1.81% 0% 85.98%
Truck.Owner False: 79.26% True: 20.74% 0% 89.44%
RV.Owner False: 92.23% True: 7.77% 0% 89.44%
Motorcycle.Owner False: 98.83% True: 1.17% 0% 89.44%
Marital.Status Married: 59.26% Single: 27.16% Inferred Married: 6.83% Inferred Single: 6.75% 0% 90.97%
Working.Woman False: 67.26% True: 32.74% 0% 89.44%
Mail.Responder True: 73.6% False: 26.4% 0% 89.44%
Bank.Card.Holder True: 81.44% False: 18.56% 0% 89.44%
Gas.Card.Holder True: 68.95% False: 31.05% 0% 89.44%
Upscale.Card.Holder False: 50.88% True: 49.12% 0% 89.44%
Unknown.Card.Type False: 59.71% True: 40.29% 0% 89.44%
TE.Card.Holder False: 90.67% True: 9.33% 0% 89.44%
Premium.Card.Holder False: 79.92% True: 20.08% 0% 89.44%
Presence.Of.Children False: 53.25% True: 46.75% 0% 89.44%
Estimated.Income.Code $50;000-$74;999: 21.33% $75;000-$99;999: 18.36% $40;000-$49;999: 13.19% $125;000 OR MORE: 10.72% $30;000-$39;999: 10.43% 0% 89.71%
Home.Market.Value $50;000-$74;999: 16.65% $100;000-$124;999: 13.96% $75;000-$99;999: 13.33% $150;000-$174;999: 8.32% $125;000-$149;999: 8.16% 0% 92.22%
New.Car.Buyer True: 100% 0% 95.32%
Vehicle.Lifestyle IMPORT (STANDARD/ECONOMY): 26.06% FULL SIZE (STANDARD/LUXURY): 24.07% TRUCK OR UTILITY VEHICLE: 12.16% SPECIALTY (MIDSIZE/SMALL): 11.19% STATION WAGON: 9.65% 0% 95.48%
Property.Type single family dwelling: 84.34% condo: 10.98% 2-4 unit(duplex;triplex;quad): 2.23% misc. residential (condo store/flat): 1.69% apartment(5+ units): 0.68% 0% 96.15%
Loan.To.Value.Percent 0% (NO LOANS): 29.12% 70-74%: 9.97% 100-99%: 9.22% 60-69%: 9.04% 01-49%: 7.9% 0% 97.33%
Presence.Of.Pool False: 98.25% True: 1.75% 0% 89.44%
Own.Or.Rent.Home Owner: 94.76% Renter: 5.24% 0% 90.68%
Mail.Order.Buyer True: 61.3% False: 38.7% 0% 89.44%
DMA.No.Mail.Solicitation.Flag True: 100% 0% 89.44%
DMA.No.Phone.Solicitation.Flag True: 100% 0% 89.44%
New.Bank.Card False: 100% 0% 89.44%
Speciality.Store.Retail False: 86.93% True: 13.07% 0% 89.44%
Oil.Retail.Activity False: 90.19% True: 9.81% 0% 89.44%
Bank.Retail.Activity False: 78.38% True: 21.62% 0% 89.44%
Finance.Retail.Activity False: 91.41% True: 8.59% 0% 89.44%
Miscellaneous.Retail.Activity False: 94.05% True: 5.95% 0% 89.44%
Upscale.Retail False: 94.59% True: 5.41% 0% 89.44%
Upscale.Speciality.Retail False: 97.06% True: 2.94% 0% 89.44%
Retail.Activity False: 64.38% True: 35.62% 0% 89.44%
Dwelling.Size SINGLE HOUSEHOLD: 74.42% 2 HOUSEHOLDS: 7.48% 100+ HOUSEHOLDS: 3.4% 10-19 HOUSEHOLDS: 2.65% 30-39 HOUSEHOLDS: 2.17% 0% 90.16%
Lendable.Home.Equity EQUITY LESS THAN OR EQUAL $0: 29.99% EQUITY $10;000-$19;9999: 11.73% EQUITY $30;000-$49;000: 10.5% EQUITY $75;000-$99;999: 8.58% EQUITY $100;000-$149;999: 7.92% 0% 97.33%
Home.Size.Range 1;250-1;499 FT: 17.76% 1;000-1;249 FT: 15.73% 750-999 FT: 12.85% 2;000-2;499 FT: 12.51% 1;500-1;749 FT: 10.59% 0% 97.13%
Lot.Size.Range 1 ACRE OR LESS: 88.42% GREATER THAN 1 ACRE: 11.58% 0% 97.51%
Dwelling.Unit.Size SINGLE FAMILY DWELLING UNIT: 74.4% MULTI FAMILY DWELLING UNIT: 25.6% 0% 89.9%
Available.Home.Equity EQUITY $30;000-$49;000: 19.28% EQUITY $50;000-$74;000: 18.7% EQUITY $100;000-$149;999: 12.29% EQUITY $20;000-$29;000: 11.46% EQUITY $75;000-$99;999: 10.96% 0% 91.8%
Minority.Census.Tract False: 99.04% True: 0.96% 0% 89.44%
Gender Female: 82.35% Male: 17.65% 0% 92.91%
Occupation PROFESSIONAL/TECHNICAL: 27.42% HOUSEWIFE: 16.64% ADMINISTRATIVE/MANAGERIAL: 14.35% CLERICAL/WHITE COLLAR: 11.61% STUDENT: 6.24% 0% 96.62%
Other.Indiv…Gender Male: 79.08% Female: 20.92% 0% 94.8%
Other.Indiv…Occupation PROFESSIONAL/TECHNICAL: 44.18% ADMINISTRATIVE/MANAGERIAL: 18% CRAFTSMAN/BLUE COLLAR: 14.53% SALES/SERVICE: 5.2% CLERICAL/WHITE COLLAR: 5.01% 0% 97.07%
Variable Max Mean Median Min SD
Year.of.Birth 1979 1963.63 1962 1948 8.88
Value.Of.All.Vehicles 99000 18863.13 16000 1000 13401.19
Age 86 37.58 36 18 10.56
Other.Indiv…Age 86 40.37 38 18 10.56
Number.Of.Adults 6 2.52 2 1 1.35
Year.House.Was.Built 1997 1965.16 1969 1850 25.23
Length.Of.Residence 15 6.77 6 0 4.38
Year.Home.Was.Bought 1999 1991.18 1993 1954 6.34
Home.Purchase.Date 199906 199121.02 199300 195400 633.68
Number.Of.Vehicles 3 1.45 1 1 0.59
CRA.Income.Classification 4 3.33 3 1 0.64
Number.Of.Credit.Lines 9 2.60 3 1 1.56
Dataquick.Market.Code 10 4.45 4 1 2.45
Insurance.Expiry.Month 12 6.51 6 1 3.46
Month.Home.Was.Bought 12 7.12 7 1 3.28
Year.Of.Structure 1999 1972.74 1981 1900 27.69

4.3.2.2 Product Data

tbd

Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
BrandName DKNY: 9.99% Silk Reflections: 9.21% ORO: 8.85% HPK: 7.22% AME: 7.13% 0% 86.16%
PrimaryPackage Bottle: 35.18% Tube: 30.46% Jar: 25.55% Box: 5.4% Spray: 3.41% 0% 96.07%
StockType Replenishable: 60.93% Seasonal 1: 23.75% Seasonal 1*: 11.25% Seasonal 2: 2.07% Replenishment: 1.83% 0% 80.3%
ProductForm Cream: 54.01% Liquid: 24.19% gel: 7.58% Lotion: 7.36% Capsule: 5.97% 0% 96.64%
Look Sheer: 83.41% Ultra Sheer: 13.1% Opaque: 3.48% 0% 94.22%
BasicOrFashion Basic: 92.33% Fashion: 7.67% 0% 86.11%
MfgStyleCode Tricot: 2.34% BC27340: 1.45% 00N02: 1.35% 00Q63: 1.34% 5751: 1.2% 10.26% 82.06%
SaleOrNonSale NSALE: 100% 0% 94.46%
HasDressingRoom False: 73.56% True: 26.44% 0% 86.09%
ColorOrScent Scent: 85.69% Color: 14.31% 0% 99.7%
Texture Flat: 66.48% Textured: 33.52% 0% 96.65%
Manufacturer Donna Karan Company: 10.79% Peneco: 9.08% HAN: 8.66% Kneipp: 6.61% Paul Lavitt Mills Inc.: 6.54% 0% 80.12%
ToeFeature SF: 86.04% RT: 13.96% 0% 93.49%
Category2 Gift Sets & Special Items: 32.01% Skincare: 23.28% Cellulite & Other Treatments: 22.82% Footcare: 14.47% Health Supplements: 6.2% 0% 99.21%
Material Cotton: 66.35% Nylon: 23.62% Coolmax: 3.71% Rayon: 1.49% Lycra: 1.04% 0% 93.18%
CategoryCode PH: 33.89% WDCS: 12.66% TH: 7.83% FO: 6.76% TT: 6.2% 0% 86.09%
WaistControl CT: 76.17% STW: 23.83% 0% 94.34%
Collection Oroblu Italian Hosiery: 6.31% Conversationals: 5.17% DKNY Skin: 4.41% Action Pack 3-Pair: 3.9% Womens Dance: 3.84% 0% 86.96%
BodyFeature MBC: 64.5% UBC: 17.51% LBC: 11.17% BS: 6.82% 0% 98.49%
Audience Women: 80.86% Men: 10.21% Children: 8.93% 0% 86.09%
Category1 Skincare: 60.62% Footcare: 18.05% Cellulite & Other Treatments: 15% Hair Removal: 3.89% Health Supplements: 1.83% 0% 94%
Product Cellulite Trimming Gel: 3.25% Body Lotion - Oceanic Minerals: 2.74% Kit-Firming Cream/Slimming Cream/Shorts: 2.57% Body Silk: 2.41% Herbal Foot Balm: 2.31% 0% 94%
Pattern Solid: 58% Conversational: 39.12% Floral: 2.03% Stripe: 0.55% Herringbone: 0.18% 0% 93.72%
Variable Max Mean Median Min SD
UnitsPerInnerBox 12.0 4.44 3.00 1.00 2.91
Depth 16.0 2.73 2.50 0.50 2.15
VendorMinREOrderDollars 500.0 161.15 150.00 100.00 82.29
Height 8.5 1.18 0.75 0.25 1.20
UnitsPerOuterBox 144.0 18.57 12.00 4.00 16.83
Pack 3.0 1.14 1.00 1.00 0.50
Length 16.5 9.12 9.25 3.50 1.65
MinQty 144.0 16.48 6.00 0.00 30.45
LeadTime 28.0 10.95 10.00 1.00 6.41
Weight 40.0 4.98 2.60 0.40 5.59
Width 18.0 5.68 6.25 0.50 1.93
UnitIncrement 36.0 4.81 3.00 1.00 4.38

4.4 Plotting

Some information is too complex to be compressed into a single table without making it too confusing or it’s simply easier to understand if presented as a plot. The plot types used are time series plots, stacked bar plots, distribution curves, lorenz curves and maps.

4.4.1 Order Data

4.4.1.1 Customer Data

The customer data of the order data can be viewed from two perspectives: One way is to use every single order row for the creation of the visualizations and therby create a weighted view on the data, in which customers that have bought more products are more respected. Another possibility is to display the customer information just for every unique customer in the order data and disregard the number of orders a customer made. The following plottings will show both of this perspectives.

Firstly, we genereted some density curves for the attribute age to visualize the distribution of the customership. Also, we differentiated between the genders for this. The curves all show a fast rise in customership for the ages 20 to 40, which decreases slowly after a peak at about 35 to 40. When comparing the weighted and the normal graph, a slightly shift of the curve can be observed. This indicates that in general older people tend to order more. This is especially relevant for an older male customership, which is shown by the peak at about 55 in the male curve of the weighted density plot. Additionally, a less wide peak can be observed for the male customers, which has its maximum shortly before the age of 40. Furthermore, it should be regarded, that the gender plot shows a prozentual curve for each gender, but the ratio of customers differs by gender.

In order to show the customer location, we generated a map of the United States, which shows the cities with the highest customer numbers. From this it can be observed that the west coast of the US orders in general the most. The difference between the weighted and the other customer graph shows that the customership from the large cities like New York or San Franscisco tends to have high order numbers since the large circles at this cities disappear, when we look at the non-weighted graph.

To visualize the importance of different areas, we created a heatmap for the different US States. Here we can see that California and New York are the most important customer states. This is probably highly influenced by the big cities in this states. Here, the distorting factor of population density must be taken into account.

4.4.1.2 Product Data

For the product Data, it has to be regarded that we here visualize the orders and thereby the characteristics of more often bought products have a higher influence. Because of this, the following plots should be seen as popularity graphs of different product attributes.

The following stacked bar plot shows the stock types for the top brands and manufactures. Most of them have a replenishable assortment. The biggest brands and manufacturers seem to have mixed stock types, which contain partly seasonal products.

The lorenz curve of the products shows that only a quarter of the whole product quantity is responsible for about 75% of the ordered products. This indicates that some products have a very high popularity.

The manufacturer lorenz curve shows a real high ratio on orders for the biggest manufacturer, which leads to the assumptions that the popularity of some manufacturers is even higher than the product popularity.

The brand lorenz curve looks a little more flat, indicating a higher relevance for the manufacturer than for the brand.

4.4.1.3 Payment Data

The stacked bar plot for credit card brands shows the ratio of each brand on premium and on upscale cards. The card brand AMEX is here noticable with a relative high ratio of premium and upscale cards.

4.4.1.4 Order Process Data

For the order process data some graphs referring to the order amount and price were created. It has to be respected again that the order data can deform the graphs, by example it is imaginable that the customers tend to buy the cheaper products and therefore the average product price seems lower.

The density curve for the discount amount shows 3 peaks: The first one has a medium height and is around a discount of 0%, the second one is around the 10% mark and is pretty large, whereas the last peak is at a discount of 50% and is rather low. This might indicate that customers buy rather targeted than randomly.

A density curve reflecting the order time is shown by the next plot. As to be expected the order amount goes down through the night. During the day the ordering is relative stable with some small peaks at 10 a.m. and in the afternoon. The activity in the afternoon can be explained by the average working hours, which mostly allow people only to spend time on online shopping at the afternoon and evening.

Next, there is displayed a graph, which summarizes some information on order behaviour. From this visulalization we can learn that the online shop sells products of a low price segments, but usually receives orders contaning a rather high amount (peak at 3-15) of articles. The different history plots on the right indicate a high activity in the first half of February, which drops in the beginning of the second month half and than slowly rises again. The first half could possibly be explained through customers buying presents for the valentines day (14th February).

4.4.2 Clickstream Data

5 Evaluating Recomendation Systems

In addition to analyzing the order and clickstream data, we have to analyze the performance of different recommendation models. The performance of three different recommendation systems was measured:

  • A profit based recommendation system: This one recommends products that shall fit the taste of the subject, but also generate high revenue shares.
  • A ranking based recommendation system: This one recommends best performing products according to their sales rank.
  • A random recommendation system: This one was used as a baseline treatment.

The evaluation of the profit and ranking based recommendation systems is done using inference analysis, specifically using the computational paradigm instead of the mathematical one. One test for each of the two recommendation systems is carried out with the null hypothesis always being that the system does not cause different sales than a purely random recommendation system. During each test we randomize our sample data 1000 times, using either permutation or bootstrapping, and measure the p-value and confidence interval. The test statistic, we always use, is the difference in mean between the group using the profit or ranking based Recomendation system and the group using the random recommendation system. If the null hypothesis was true, then the test statistic value for our sample would not significantly differ from the distribution of the test statistic for our randomized data. Our default alpha for the confidence interval is 5%, but since we conduct a total of two tests, we have to apply the Bonferroni correction and adjust the alpha we specify for our confidence interval to 2.5%.

Before diving into the inference analysis itself, we have to reformat our data for the recommendation systems into a form that is suitable for inference analysis. We want to have a data frame in which one row equals one customer, who was exposed to a recommendation system. We use three columns:

  • Sales: The sales in Euro per customer.
  • Used_Profit_Oriented_recommendations: 1 if the customer was exposed to the profit oriented recommendation system, otherwise 0.
  • Used_Top_recommendations: 1 if the customer was exposed to the ranking based recommendation system, otherwise 0.

If the value in column Used_Top_recommendations and Used_Profit_Oriented_recommendations is 0, it means that the random recommendation system was used.

In the following code block we reformat the data and print the first 10 rows of the reformatted table.

Sales_in_EUR Used_Profit_Oriented_recommendations Used_Top_recommendations
8.50 0 0
20.00 1 0
16.00 0 1
8.50 0 0
17.75 1 0
19.75 0 1
18.75 0 0
21.75 1 0
19.75 0 1
17.75 0 0

5.1 The Profit Oriented Recommendation System

Now, we do an inference analysis for the profit oriented recommendation system. Firstly, we look at the p-value and the corresponding plot:

## [1] "p-value = 0"

The plot shows us the distribution of the test statistic for the 1000 randomized samples. The test statistic value for our sample is represented by a black line. The two-sided p-value regions are marked by a grey background. If our null hypothesis were true, then the test statistic value of our sample would be somewhere in the distribution of the test statistic for the randomized samples. Every test statistic value for a randomized sample, which lies in the p-value region, increases the p-value.

As we can see the test statistic value of our sample is pretty far away from the test statistic values of the randomized samples. This already shows, without looking at the p-value itself, that the profit oriented recommendation system causes a significant difference in the sales in Euro. The value of the p-value is 0, which reaffirms our interpretation of the plot.

Now, we look at the confidence interval:

2.5% 97.5%
3.291107 4.399673

As we can see, there is a 95% chance that for the global population customers, who get profit oriented recommendations, spend on average 3.29€-4.4€ more than people who get random recommendations.

5.2 The Ranking Based Recommendation System

Now we do an inference analysis for the ranking based recommendation system. Firstly, we look at the p-value and the corresponding plot:

## [1] "p-value = 0.076"

In this plot quite some portions of the distribution of the test statistic for our random samples lie in the p-value zone. This is also shown by the p-value 0.076, which greater than 0.025. This means that for our alpha = 0.05 the effect of the ranking based recommendation system is statistically insignificant.

Now we look at the confidence interval:

2.5% 97.5%
-0.1074255 1.279533

Since the confidence interval includes the value 0, it shows us that the effect is statistically insignificant.

5.3 Summary

To sum it up, the company should use the profit oriented recommendation system, since out of the two tested systems it causes the largest increase in sales. The ranking based recommendation system does not cause any statistically relevant difference in sales. However if the effect was something else than sales in Euro per person, then the results could be different. The company should ask itself if the goal of their recommendation system should really be to increase sales. Maybe at some point in time it could introduce a subscription business model, similar to that of Amazon. In that case it might also want to increase the percentage of customers that have a subscription.

6 Creating a Model

7 Summary

8 Appendix